Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. https://machinelearningmastery.com/

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 10446. Two algorithms (Random Forest and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved the best RMSE of 10299. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an RMSE of 12978, which was slightly worse than the accuracy of the training data and possibly due to over-fitting.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the top training and validation results comparing to other machine learning algorithms. For this dataset, Random Forest should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Problem
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Problem

1.a) Load libraries

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(parallel)
library(mailR)

# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)

1.b) Load dataset

originalDataset <- read.csv("OnlineNewsPopularity.csv", header= TRUE)

# Dropping the two non-predictive attributes: url and timedelta
originalDataset$url <- NULL
originalDataset$timedelta <- NULL

# Different ways of reading and processing the input dataset. Saving these for future references.
#x_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#xy_train <- cbind(x_train, y_train)
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(originalDataset)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol
colnames(originalDataset)[targetCol] <- "targetVar"
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)

# Create a list of the rows in the original dataset we can use for training
training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
xy_train <- originalDataset[training_index,]
xy_test <- originalDataset[-training_index,]

if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
y_test <- xy_test[,totCol]
}

1.c) Set up the key parameters to be used in the script

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 4
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  4  by  15

1.d) Set test options and evaluation metric

# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "RMSE"

1.e) Set up the email notification function

email_notify <- function(msg=""){
  sender <- "luozhi2488@gmail.com"
  receiver <- "dave@contactdavidlowe.com"
  sbj_line <- "Notification from R Script"
  password <- readLines("email_credential.txt")
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
email_notify(paste("Library and Data Loading Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@47fd17e3}"

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

2.a) Descriptive statistics

2.a.i) Peek at the data itself.

head(xy_train)
##   n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## 2              9              255       0.6047431                1
## 3              9              211       0.5751295                1
## 5             13             1072       0.4156456                1
## 6             10              370       0.5598886                1
## 7              8              960       0.4181626                1
## 8             12              989       0.4335736                1
##   n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos
## 2                0.7919463         3              1        1          0
## 3                0.6638655         3              1        1          0
## 5                0.5408895        19             19       20          0
## 6                0.6981982         2              2        0          0
## 7                0.5498339        21             20       20          0
## 8                0.5721078        20             20       20          0
##   average_token_length num_keywords data_channel_is_lifestyle
## 2             4.913725            4                         0
## 3             4.393365            6                         0
## 5             4.682836            7                         0
## 6             4.359459            9                         0
## 7             4.654167           10                         1
## 8             4.617796            9                         0
##   data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
## 2                             0                   1                      0
## 3                             0                   1                      0
## 5                             0                   0                      0
## 6                             0                   0                      0
## 7                             0                   0                      0
## 8                             0                   0                      0
##   data_channel_is_tech data_channel_is_world kw_min_min kw_max_min
## 2                    0                     0          0          0
## 3                    0                     0          0          0
## 5                    1                     0          0          0
## 6                    1                     0          0          0
## 7                    0                     0          0          0
## 8                    1                     0          0          0
##   kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg
## 2          0          0          0          0          0          0
## 3          0          0          0          0          0          0
## 5          0          0          0          0          0          0
## 6          0          0          0          0          0          0
## 7          0          0          0          0          0          0
## 8          0          0          0          0          0          0
##   kw_avg_avg self_reference_min_shares self_reference_max_shares
## 2          0                         0                         0
## 3          0                       918                       918
## 5          0                       545                     16000
## 6          0                      8500                      8500
## 7          0                       545                     16000
## 8          0                       545                     16000
##   self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
## 2                      0.000                 1                  0
## 3                    918.000                 1                  0
## 5                   3151.158                 1                  0
## 6                   8500.000                 1                  0
## 7                   3151.158                 1                  0
## 8                   3151.158                 1                  0
##   weekday_is_wednesday weekday_is_thursday weekday_is_friday
## 2                    0                   0                 0
## 3                    0                   0                 0
## 5                    0                   0                 0
## 6                    0                   0                 0
## 7                    0                   0                 0
## 8                    0                   0                 0
##   weekday_is_saturday weekday_is_sunday is_weekend     LDA_00     LDA_01
## 2                   0                 0          0 0.79975569 0.05004668
## 3                   0                 0          0 0.21779229 0.03333446
## 5                   0                 0          0 0.02863281 0.02879355
## 6                   0                 0          0 0.02224528 0.30671758
## 7                   0                 0          0 0.02008167 0.11470539
## 8                   0                 0          0 0.02222436 0.15073297
##       LDA_02     LDA_03     LDA_04 global_subjectivity
## 2 0.05009625 0.05010067 0.05000071           0.3412458
## 3 0.03335142 0.03333354 0.68218829           0.7022222
## 5 0.02857518 0.02857168 0.88542678           0.5135021
## 6 0.02223128 0.02222429 0.62658158           0.4374086
## 7 0.02002437 0.02001533 0.82517325           0.5144803
## 8 0.24343548 0.02222360 0.56138359           0.5434742
##   global_sentiment_polarity global_rate_positive_words
## 2                0.14894781                 0.04313725
## 3                0.32333333                 0.05687204
## 5                0.28100348                 0.07462687
## 6                0.07118419                 0.02972973
## 7                0.26830272                 0.08020833
## 8                0.29861347                 0.08392315
##   global_rate_negative_words rate_positive_words rate_negative_words
## 2                0.015686275           0.7333333           0.2666667
## 3                0.009478673           0.8571429           0.1428571
## 5                0.012126866           0.8602151           0.1397849
## 6                0.027027027           0.5238095           0.4761905
## 7                0.016666667           0.8279570           0.1720430
## 8                0.015166835           0.8469388           0.1530612
##   avg_positive_polarity min_positive_polarity max_positive_polarity
## 2             0.2869146            0.03333333                   0.7
## 3             0.4958333            0.10000000                   1.0
## 5             0.4111274            0.03333333                   1.0
## 6             0.3506100            0.13636364                   0.6
## 7             0.4020386            0.10000000                   1.0
## 8             0.4277205            0.10000000                   1.0
##   avg_negative_polarity min_negative_polarity max_negative_polarity
## 2            -0.1187500                -0.125            -0.1000000
## 3            -0.4666667                -0.800            -0.1333333
## 5            -0.2201923                -0.500            -0.0500000
## 6            -0.1950000                -0.400            -0.1000000
## 7            -0.2244792                -0.500            -0.0500000
## 8            -0.2427778                -0.500            -0.0500000
##   title_subjectivity title_sentiment_polarity abs_title_subjectivity
## 2          0.0000000                0.0000000             0.50000000
## 3          0.0000000                0.0000000             0.50000000
## 5          0.4545455                0.1363636             0.04545455
## 6          0.6428571                0.2142857             0.14285714
## 7          0.0000000                0.0000000             0.50000000
## 8          1.0000000                0.5000000             0.50000000
##   abs_title_sentiment_polarity targetVar
## 2                    0.0000000       711
## 3                    0.0000000      1500
## 5                    0.1363636       505
## 6                    0.2142857       855
## 7                    0.0000000       556
## 8                    0.5000000       891

2.a.ii) Dimensions of the dataset.

dim(xy_train)
## [1] 27752    59
dim(xy_test)
## [1] 11892    59

2.a.iii) Types of the attributes.

sapply(xy_train, class)
##                n_tokens_title              n_tokens_content 
##                     "numeric"                     "numeric" 
##               n_unique_tokens              n_non_stop_words 
##                     "numeric"                     "numeric" 
##      n_non_stop_unique_tokens                     num_hrefs 
##                     "numeric"                     "numeric" 
##                num_self_hrefs                      num_imgs 
##                     "numeric"                     "numeric" 
##                    num_videos          average_token_length 
##                     "numeric"                     "numeric" 
##                  num_keywords     data_channel_is_lifestyle 
##                     "numeric"                     "numeric" 
## data_channel_is_entertainment           data_channel_is_bus 
##                     "numeric"                     "numeric" 
##        data_channel_is_socmed          data_channel_is_tech 
##                     "numeric"                     "numeric" 
##         data_channel_is_world                    kw_min_min 
##                     "numeric"                     "numeric" 
##                    kw_max_min                    kw_avg_min 
##                     "numeric"                     "numeric" 
##                    kw_min_max                    kw_max_max 
##                     "numeric"                     "numeric" 
##                    kw_avg_max                    kw_min_avg 
##                     "numeric"                     "numeric" 
##                    kw_max_avg                    kw_avg_avg 
##                     "numeric"                     "numeric" 
##     self_reference_min_shares     self_reference_max_shares 
##                     "numeric"                     "numeric" 
##    self_reference_avg_sharess             weekday_is_monday 
##                     "numeric"                     "numeric" 
##            weekday_is_tuesday          weekday_is_wednesday 
##                     "numeric"                     "numeric" 
##           weekday_is_thursday             weekday_is_friday 
##                     "numeric"                     "numeric" 
##           weekday_is_saturday             weekday_is_sunday 
##                     "numeric"                     "numeric" 
##                    is_weekend                        LDA_00 
##                     "numeric"                     "numeric" 
##                        LDA_01                        LDA_02 
##                     "numeric"                     "numeric" 
##                        LDA_03                        LDA_04 
##                     "numeric"                     "numeric" 
##           global_subjectivity     global_sentiment_polarity 
##                     "numeric"                     "numeric" 
##    global_rate_positive_words    global_rate_negative_words 
##                     "numeric"                     "numeric" 
##           rate_positive_words           rate_negative_words 
##                     "numeric"                     "numeric" 
##         avg_positive_polarity         min_positive_polarity 
##                     "numeric"                     "numeric" 
##         max_positive_polarity         avg_negative_polarity 
##                     "numeric"                     "numeric" 
##         min_negative_polarity         max_negative_polarity 
##                     "numeric"                     "numeric" 
##            title_subjectivity      title_sentiment_polarity 
##                     "numeric"                     "numeric" 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                     "numeric"                     "numeric" 
##                     targetVar 
##                     "integer"

2.a.iv) Statistical summary of all attributes.

summary(xy_train)
##  n_tokens_title n_tokens_content n_unique_tokens    n_non_stop_words  
##  Min.   : 3.0   Min.   :   0.0   Min.   :  0.0000   Min.   :   0.000  
##  1st Qu.: 9.0   1st Qu.: 246.0   1st Qu.:  0.4707   1st Qu.:   1.000  
##  Median :10.0   Median : 409.0   Median :  0.5393   Median :   1.000  
##  Mean   :10.4   Mean   : 547.2   Mean   :  0.5555   Mean   :   1.008  
##  3rd Qu.:12.0   3rd Qu.: 716.0   3rd Qu.:  0.6081   3rd Qu.:   1.000  
##  Max.   :23.0   Max.   :8474.0   Max.   :701.0000   Max.   :1042.000  
##  n_non_stop_unique_tokens   num_hrefs      num_self_hrefs  
##  Min.   :  0.0000         Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:  0.6255         1st Qu.:  4.00   1st Qu.: 1.000  
##  Median :  0.6903         Median :  7.00   Median : 3.000  
##  Mean   :  0.6957         Mean   : 10.88   Mean   : 3.302  
##  3rd Qu.:  0.7542         3rd Qu.: 14.00   3rd Qu.: 4.000  
##  Max.   :650.0000         Max.   :304.00   Max.   :74.000  
##     num_imgs         num_videos     average_token_length  num_keywords   
##  Min.   :  0.000   Min.   : 0.000   Min.   :0.000        Min.   : 1.000  
##  1st Qu.:  1.000   1st Qu.: 0.000   1st Qu.:4.477        1st Qu.: 6.000  
##  Median :  1.000   Median : 0.000   Median :4.662        Median : 7.000  
##  Mean   :  4.563   Mean   : 1.262   Mean   :4.546        Mean   : 7.227  
##  3rd Qu.:  4.000   3rd Qu.: 1.000   3rd Qu.:4.854        3rd Qu.: 9.000  
##  Max.   :111.000   Max.   :91.000   Max.   :6.610        Max.   :10.000  
##  data_channel_is_lifestyle data_channel_is_entertainment
##  Min.   :0.00000           Min.   :0.000                
##  1st Qu.:0.00000           1st Qu.:0.000                
##  Median :0.00000           Median :0.000                
##  Mean   :0.05387           Mean   :0.178                
##  3rd Qu.:0.00000           3rd Qu.:0.000                
##  Max.   :1.00000           Max.   :1.000                
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  Min.   :0.0000      Min.   :0.00000        Min.   :0.0000      
##  1st Qu.:0.0000      1st Qu.:0.00000        1st Qu.:0.0000      
##  Median :0.0000      Median :0.00000        Median :0.0000      
##  Mean   :0.1579      Mean   :0.05801        Mean   :0.1864      
##  3rd Qu.:0.0000      3rd Qu.:0.00000        3rd Qu.:0.0000      
##  Max.   :1.0000      Max.   :1.00000        Max.   :1.0000      
##  data_channel_is_world   kw_min_min       kw_max_min       kw_avg_min     
##  Min.   :0.0000        Min.   : -1.00   Min.   :     0   Min.   :   -1.0  
##  1st Qu.:0.0000        1st Qu.: -1.00   1st Qu.:   450   1st Qu.:  141.9  
##  Median :0.0000        Median : -1.00   Median :   662   Median :  235.1  
##  Mean   :0.2092        Mean   : 26.13   Mean   :  1159   Mean   :  313.8  
##  3rd Qu.:0.0000        3rd Qu.:  4.00   3rd Qu.:  1000   3rd Qu.:  356.8  
##  Max.   :1.0000        Max.   :377.00   Max.   :298400   Max.   :42827.9  
##    kw_min_max       kw_max_max       kw_avg_max       kw_min_avg  
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :  -1  
##  1st Qu.:     0   1st Qu.:843300   1st Qu.:172048   1st Qu.:   0  
##  Median :  1400   Median :843300   Median :245025   Median :1034  
##  Mean   : 13458   Mean   :752066   Mean   :259524   Mean   :1122  
##  3rd Qu.:  7900   3rd Qu.:843300   3rd Qu.:331986   3rd Qu.:2066  
##  Max.   :843300   Max.   :843300   Max.   :843300   Max.   :3613  
##    kw_max_avg       kw_avg_avg    self_reference_min_shares
##  Min.   :     0   Min.   :    0   Min.   :     0           
##  1st Qu.:  3564   1st Qu.: 2386   1st Qu.:   638           
##  Median :  4358   Median : 2870   Median :  1200           
##  Mean   :  5640   Mean   : 3137   Mean   :  4084           
##  3rd Qu.:  6021   3rd Qu.: 3605   3rd Qu.:  2600           
##  Max.   :298400   Max.   :43568   Max.   :843300           
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :     0            Min.   :     0             Min.   :0.0000   
##  1st Qu.:  1100            1st Qu.:   985             1st Qu.:0.0000   
##  Median :  2800            Median :  2200             Median :0.0000   
##  Mean   : 10164            Mean   :  6380             Mean   :0.1689   
##  3rd Qu.:  7900            3rd Qu.:  5100             3rd Qu.:0.0000   
##  Max.   :843300            Max.   :843300             Max.   :1.0000   
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  Min.   :0.0000     Min.   :0.0000       Min.   :0.0000     
##  1st Qu.:0.0000     1st Qu.:0.0000       1st Qu.:0.0000     
##  Median :0.0000     Median :0.0000       Median :0.0000     
##  Mean   :0.1865     Mean   :0.1886       Mean   :0.1833     
##  3rd Qu.:0.0000     3rd Qu.:0.0000       3rd Qu.:0.0000     
##  Max.   :1.0000     Max.   :1.0000       Max.   :1.0000     
##  weekday_is_friday weekday_is_saturday weekday_is_sunday   is_weekend    
##  Min.   :0.0000    Min.   :0.00000     Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000    1st Qu.:0.00000     1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000    Median :0.00000     Median :0.00000   Median :0.0000  
##  Mean   :0.1434    Mean   :0.06191     Mean   :0.06735   Mean   :0.1293  
##  3rd Qu.:0.0000    3rd Qu.:0.00000     3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000    Max.   :1.00000     Max.   :1.00000   Max.   :1.0000  
##      LDA_00            LDA_01            LDA_02            LDA_03       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.02505   1st Qu.:0.02501   1st Qu.:0.02857   1st Qu.:0.02857  
##  Median :0.03339   Median :0.03334   Median :0.04000   Median :0.04000  
##  Mean   :0.18415   Mean   :0.14087   Mean   :0.21465   Mean   :0.22515  
##  3rd Qu.:0.24039   3rd Qu.:0.15034   3rd Qu.:0.32802   3rd Qu.:0.38152  
##  Max.   :0.92699   Max.   :0.92595   Max.   :0.92000   Max.   :0.91998  
##      LDA_04        global_subjectivity global_sentiment_polarity
##  Min.   :0.00000   Min.   :0.0000      Min.   :-0.38021         
##  1st Qu.:0.02857   1st Qu.:0.3955      1st Qu.: 0.05712         
##  Median :0.04073   Median :0.4534      Median : 0.11867         
##  Mean   :0.23514   Mean   :0.4430      Mean   : 0.11861         
##  3rd Qu.:0.40359   3rd Qu.:0.5083      3rd Qu.: 0.17700         
##  Max.   :0.92712   Max.   :1.0000      Max.   : 0.65500         
##  global_rate_positive_words global_rate_negative_words rate_positive_words
##  Min.   :0.00000            Min.   :0.000000           Min.   :0.0000     
##  1st Qu.:0.02834            1st Qu.:0.009662           1st Qu.:0.6000     
##  Median :0.03888            Median :0.015326           Median :0.7097     
##  Mean   :0.03955            Mean   :0.016647           Mean   :0.6815     
##  3rd Qu.:0.05025            3rd Qu.:0.021739           3rd Qu.:0.8000     
##  Max.   :0.15217            Max.   :0.184932           Max.   :1.0000     
##  rate_negative_words avg_positive_polarity min_positive_polarity
##  Min.   :0.0000      Min.   :0.0000        Min.   :0.00000      
##  1st Qu.:0.1852      1st Qu.:0.3056        1st Qu.:0.05000      
##  Median :0.2800      Median :0.3583        Median :0.10000      
##  Mean   :0.2884      Mean   :0.3532        Mean   :0.09536      
##  3rd Qu.:0.3846      3rd Qu.:0.4108        3rd Qu.:0.10000      
##  Max.   :1.0000      Max.   :1.0000        Max.   :1.00000      
##  max_positive_polarity avg_negative_polarity min_negative_polarity
##  Min.   :0.0000        Min.   :-1.0000       Min.   :-1.0000      
##  1st Qu.:0.6000        1st Qu.:-0.3282       1st Qu.:-0.7000      
##  Median :0.8000        Median :-0.2536       Median :-0.5000      
##  Mean   :0.7553        Mean   :-0.2596       Mean   :-0.5222      
##  3rd Qu.:1.0000        3rd Qu.:-0.1873       3rd Qu.:-0.3000      
##  Max.   :1.0000        Max.   : 0.0000       Max.   : 0.0000      
##  max_negative_polarity title_subjectivity title_sentiment_polarity
##  Min.   :-1.0000       Min.   :0.0000     Min.   :-1.00000        
##  1st Qu.:-0.1250       1st Qu.:0.0000     1st Qu.: 0.00000        
##  Median :-0.1000       Median :0.1429     Median : 0.00000        
##  Mean   :-0.1073       Mean   :0.2819     Mean   : 0.07093        
##  3rd Qu.:-0.0500       3rd Qu.:0.5000     3rd Qu.: 0.13750        
##  Max.   : 0.0000       Max.   :1.0000     Max.   : 1.00000        
##  abs_title_subjectivity abs_title_sentiment_polarity   targetVar     
##  Min.   :0.0000         Min.   :0.0000               Min.   :     4  
##  1st Qu.:0.1667         1st Qu.:0.0000               1st Qu.:   946  
##  Median :0.5000         Median :0.0000               Median :  1400  
##  Mean   :0.3419         Mean   :0.1558               Mean   :  3366  
##  3rd Qu.:0.5000         3rd Qu.:0.2500               3rd Qu.:  2800  
##  Max.   :0.5000         Max.   :1.0000               Max.   :690400

2.a.v) Count missing values.

sapply(xy_train, function(x) sum(is.na(x)))
##                n_tokens_title              n_tokens_content 
##                             0                             0 
##               n_unique_tokens              n_non_stop_words 
##                             0                             0 
##      n_non_stop_unique_tokens                     num_hrefs 
##                             0                             0 
##                num_self_hrefs                      num_imgs 
##                             0                             0 
##                    num_videos          average_token_length 
##                             0                             0 
##                  num_keywords     data_channel_is_lifestyle 
##                             0                             0 
## data_channel_is_entertainment           data_channel_is_bus 
##                             0                             0 
##        data_channel_is_socmed          data_channel_is_tech 
##                             0                             0 
##         data_channel_is_world                    kw_min_min 
##                             0                             0 
##                    kw_max_min                    kw_avg_min 
##                             0                             0 
##                    kw_min_max                    kw_max_max 
##                             0                             0 
##                    kw_avg_max                    kw_min_avg 
##                             0                             0 
##                    kw_max_avg                    kw_avg_avg 
##                             0                             0 
##     self_reference_min_shares     self_reference_max_shares 
##                             0                             0 
##    self_reference_avg_sharess             weekday_is_monday 
##                             0                             0 
##            weekday_is_tuesday          weekday_is_wednesday 
##                             0                             0 
##           weekday_is_thursday             weekday_is_friday 
##                             0                             0 
##           weekday_is_saturday             weekday_is_sunday 
##                             0                             0 
##                    is_weekend                        LDA_00 
##                             0                             0 
##                        LDA_01                        LDA_02 
##                             0                             0 
##                        LDA_03                        LDA_04 
##                             0                             0 
##           global_subjectivity     global_sentiment_polarity 
##                             0                             0 
##    global_rate_positive_words    global_rate_negative_words 
##                             0                             0 
##           rate_positive_words           rate_negative_words 
##                             0                             0 
##         avg_positive_polarity         min_positive_polarity 
##                             0                             0 
##         max_positive_polarity         avg_negative_polarity 
##                             0                             0 
##         min_negative_polarity         max_negative_polarity 
##                             0                             0 
##            title_subjectivity      title_sentiment_polarity 
##                             0                             0 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                             0                             0 
##                     targetVar 
##                             0

2.b) Data visualizations

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(x_train[,i], main=names(x_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(x_train[,i], main=names(x_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(x_train[,i]), main=names(x_train)[i])
}

# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")

email_notify(paste("Data Summary and Visualization Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@34340fab}"

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

3.a) Data Cleaning

# Not applicable for this iteration of the project.

# Mark missing values
#invalid <- 0
#entireDataset$some_col[entireDataset$some_col==invalid] <- NA

# Impute missing values
#entireDataset$some_col <- with(entireDataset, impute(some_col, mean))

3.b) Feature Selection

# Not applicable for this iteration of the project.

3.c) Data Transforms

# Not applicable for this iteration of the project.
proc.time()-startTimeScript
##    user  system elapsed 
##  31.287   0.594  35.516
email_notify(paste("Data Cleaning and Transformation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@546a03af}"

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the dataset. The typical evaluation tasks include:

For this project, we will evaluate four linear, three non-linear, and three ensemble algorithms:

Linear Algorithms: Linear Regression, Ridge, LASSO, and ElasticNet

Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine

Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

# Linear Regression (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lm <- train(targetVar~., data=xy_train, method="lm", metric=metricTarget, trControl=control)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
print(fit.lm)
## Linear Regression 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   10687.47  0.02589326  3038.825
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
proc.time()-startTimeModule
##    user  system elapsed 
##   5.576   0.091   5.728
email_notify(paste("Linear Regression Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2aaf7cc2}"
# Ridge (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.ridge <- train(targetVar~., data=xy_train, method="ridge", metric=metricTarget, trControl=control)
print(fit.ridge)
## Ridge Regression 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared    MAE     
##   0e+00   10687.47  0.02589326  3038.825
##   1e-04   10568.65  0.02597326  3033.107
##   1e-01   10324.02  0.02589728  3013.986
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
proc.time()-startTimeModule
##    user  system elapsed 
##  41.450   1.331  43.262
email_notify(paste("Ridge Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@357246de}"
# lasso (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lasso <- train(targetVar~., data=xy_train, method="lasso", metric=metricTarget, trControl=control)
print(fit.lasso)
## The lasso 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared    MAE     
##   0.1       10323.76  0.02643750  3021.107
##   0.5       10319.86  0.02691661  3016.310
##   0.9       11035.37  0.02589539  3048.686
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.5.
proc.time()-startTimeModule
##    user  system elapsed 
##  16.708   0.271  17.165
email_notify(paste("lasso Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@23223dd8}"
# ElasticNet (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.en <- train(targetVar~., data=xy_train, method="enet", metric=metricTarget, trControl=control)
print(fit.en)
## Elasticnet 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared    MAE     
##   0e+00   0.050     10326.34  0.02619419  3023.299
##   0e+00   0.525     10330.25  0.02633867  3019.543
##   0e+00   1.000     10687.47  0.02589326  3038.825
##   1e-04   0.050     10348.49  0.02451538  3062.965
##   1e-04   0.525     10323.10  0.02669545  3015.345
##   1e-04   1.000     10568.65  0.02597326  3033.107
##   1e-01   0.050     10390.88  0.02379999  3108.682
##   1e-01   0.525     10334.60  0.02630263  3014.793
##   1e-01   1.000     10324.02  0.02589728  3013.986
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.525 and lambda
##  = 1e-04.
proc.time()-startTimeModule
##    user  system elapsed 
##  40.005   1.542  42.062
email_notify(paste("ElasticNet Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1fbc7afb}"

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
print(fit.cart)
## CART 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE      Rsquared     MAE     
##   0.008476391  10634.61  0.008815223  3102.138
##   0.009380481  10549.96  0.010943961  3097.100
##   0.012215379  10420.00  0.006980293  3130.078
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01221538.
proc.time()-startTimeModule
##    user  system elapsed 
##  20.022   0.156  20.405
email_notify(paste("Decision Tree Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@c818063}"
# k-Nearest Neighbors (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
## k-Nearest Neighbors 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared     MAE     
##   5  11324.73  0.002064483  3303.791
##   7  11079.79  0.002283724  3243.485
##   9  10934.61  0.002877122  3193.897
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
proc.time()-startTimeModule
##    user  system elapsed 
## 178.069   0.125 180.096
email_notify(paste("k-Nearest Neighbors Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@129a8472}"
# Support Vector Machine (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared    MAE     
##   0.25  10423.99  0.02790161  2451.026
##   0.50  10411.88  0.02764526  2461.872
##   1.00  10399.65  0.02622827  2482.555
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01220669
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01220669 and C = 1.
proc.time()-startTimeModule
##     user   system  elapsed 
## 10262.65    10.69 10388.01
email_notify(paste("Support Vector Machine Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1d251891}"

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   10437.36  0.01182934  3076.994
proc.time()-startTimeModule
##    user  system elapsed 
## 132.495   0.783 134.717
email_notify(paste("Bagged CART Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@43a25848}"
# Random Forest (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE     
##    2    10302.77  0.03086183  3080.526
##   30    10628.66  0.01653037  3326.837
##   58    10828.11  0.01285820  3364.917
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##      user    system   elapsed 
## 52084.841    21.779 52644.509
email_notify(paste("Random Forest Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7f63425a}"
# Stochastic Gradient Boosting (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared    MAE     
##   1                   50      10318.05  0.02681133  3021.410
##   1                  100      10316.37  0.02735201  3016.105
##   1                  150      10314.34  0.02796323  3009.380
##   2                   50      10393.08  0.01825524  3042.927
##   2                  100      10440.35  0.01694738  3067.113
##   2                  150      10485.55  0.01471246  3086.521
##   3                   50      10419.72  0.01716931  3054.683
##   3                  100      10500.70  0.01444777  3083.653
##   3                  150      10544.39  0.01329798  3108.980
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
## 211.797   0.462 214.440
email_notify(paste("Stochastic Gradient Boosting Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@59a6e353}"

4.d) Compare baseline algorithms

results <- resamples(list(LR=fit.lm, RIDGE=fit.ridge, LASSO=fit.lasso, EN=fit.en, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LR, RIDGE, LASSO, EN, CART, kNN, SVM, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## MAE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LR      2853.113 2905.297 3009.479 3038.825 3113.773 3457.176    0
## RIDGE   2854.769 2903.960 3003.489 3013.986 3100.693 3241.716    0
## LASSO   2853.107 2905.297 3009.483 3016.310 3113.764 3232.026    0
## EN      2850.611 2901.678 3005.177 3015.345 3110.871 3252.015    0
## CART    2963.559 3010.490 3128.610 3130.078 3234.319 3355.742    0
## kNN     3041.168 3100.604 3221.046 3193.897 3272.262 3336.732    0
## SVM     2273.383 2383.040 2494.444 2482.555 2586.270 2702.165    0
## BagCART 2876.399 2956.958 3087.981 3076.994 3157.194 3369.829    0
## RF      2935.290 2975.430 3048.084 3080.526 3170.419 3330.873    0
## GBM     2836.234 2901.780 3010.330 3009.380 3093.273 3241.510    0
## 
## RMSE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LR      6473.397 7475.927 9353.791 10687.47 13453.67 19391.85    0
## RIDGE   6468.889 7476.762 9346.474 10324.02 13450.79 15723.62    0
## LASSO   6473.386 7475.922 9353.787 10319.86 13453.67 15715.82    0
## EN      6471.524 7474.360 9352.440 10323.10 13452.46 15757.94    0
## CART    6553.440 7633.558 9437.157 10420.00 13528.36 15792.26    0
## kNN     7479.592 8416.436 9996.430 10934.61 13835.92 15992.46    0
## SVM     6436.948 7591.257 9435.654 10399.65 13555.28 15780.88    0
## BagCART 6477.420 7808.695 9402.462 10437.36 13561.09 15818.12    0
## RF      6419.987 7457.506 9319.994 10302.77 13451.09 15690.82    0
## GBM     6401.011 7500.555 9327.371 10314.34 13468.91 15713.85    0
## 
## Rsquared 
##                 Min.     1st Qu.      Median        Mean     3rd Qu.
## LR      0.0001445875 0.013447905 0.028670520 0.025893255 0.039248461
## RIDGE   0.0089961365 0.014125860 0.028216534 0.025897280 0.037897304
## LASSO   0.0103712171 0.013448496 0.028671112 0.026916610 0.039249848
## EN      0.0067949048 0.013530877 0.028927559 0.026695448 0.039563522
## CART    0.0017896513 0.004937158 0.006535587 0.006980293 0.008578722
## kNN     0.0003029152 0.001519501 0.002782245 0.002877122 0.003515545
## SVM     0.0094630092 0.016332825 0.019098627 0.026228272 0.033162854
## BagCART 0.0024898907 0.007728587 0.011540669 0.011829343 0.015565381
## RF      0.0123985380 0.015550553 0.031066597 0.030861831 0.045613953
## GBM     0.0106486418 0.012580137 0.028895566 0.027963229 0.039438171
##                Max. NA's
## LR      0.045714336    0
## RIDGE   0.041662828    0
## LASSO   0.045714326    0
## EN      0.045772593    0
## CART    0.013060346    6
## kNN     0.006195153    0
## SVM     0.068866081    0
## BagCART 0.026884271    0
## RF      0.048590283    0
## GBM     0.048644403    0
dotplot(results)

cat('The average RMSE from all models is:',
    mean(c(results$values$`LR~RMSE`, results$values$`RIDGE~RMSE`, results$values$`LASSO~RMSE`, results$values$`EN~RMSE`, results$values$`CART~RMSE`, results$values$`kNN~RMSE`, results$values$`SVM~RMSE`, results$values$`BagCART~RMSE`, results$values$`RF~RMSE`, results$values$`GBM~RMSE`)))
## The average RMSE from all models is: 10446.32
email_notify(paste("Baseline Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2812cbfa}"

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the two best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - Random Forest
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(1:4))
fit.final1 <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)

print(fit.final1)
## Random Forest 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE     
##   1     10307.52  0.03043831  3029.097
##   2     10299.56  0.03128399  3081.226
##   3     10316.41  0.03003893  3114.383
##   4     10340.40  0.02770603  3145.916
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##      user    system   elapsed 
## 11578.295    33.725 11741.247
email_notify(paste("Algorithm #1 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4411d970}"
# Tuning algorithm #2 - Stochastic Gradient Boostin
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(.n.trees=c(50,100,150,200), .shrinkage=0.1, .interaction.depth=c(1,2), .n.minobsinnode=10)
fit.final2 <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)

print(fit.final2)
## Stochastic Gradient Boosting 
## 
## 27752 samples
##    58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared    MAE     
##   1                   50      10314.36  0.02731590  3023.879
##   1                  100      10311.88  0.02797710  3013.768
##   1                  150      10314.66  0.02781298  3012.772
##   1                  200      10316.04  0.02782426  3014.574
##   2                   50      10379.25  0.01927231  3039.096
##   2                  100      10419.13  0.01767183  3054.236
##   2                  150      10472.71  0.01534379  3083.076
##   2                  200      10535.51  0.01316090  3105.615
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
## 150.712   0.411 152.756
email_notify(paste("Algorithm #2 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@380fb434}"

5.d) Compare Algorithms After Tuning

results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM 
## Number of resamples: 10 
## 
## MAE 
##         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RF  2921.914 2991.992 3052.158 3081.226 3173.152 3330.654    0
## GBM 2844.515 2901.300 3014.546 3013.768 3102.411 3253.453    0
## 
## RMSE 
##         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RF  6432.928 7463.356 9324.177 10299.56 13432.92 15674.88    0
## GBM 6396.169 7496.137 9325.899 10311.88 13459.53 15709.52    0
## 
## Rsquared 
##           Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## RF  0.01180443 0.01788671 0.03053476 0.03128399 0.04240005 0.05841516    0
## GBM 0.01033234 0.01340071 0.02969721 0.02797710 0.04000796 0.04874818    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

6.a) Predictions on validation dataset

predictions <- predict(fit.final1, newdata=xy_test)
print(RMSE(predictions, y_test))
## [1] 12978.33
print(R2(predictions, y_test))
## [1] 0.02308703

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(seedNum)
finalModel <- randomForest(targetVar~., data=xy_train, mtry=2)
summary(finalModel)
##                 Length Class  Mode     
## call                4  -none- call     
## type                1  -none- character
## predicted       27752  -none- numeric  
## mse               500  -none- numeric  
## rsq               500  -none- numeric  
## oob.times       27752  -none- numeric  
## importance         58  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y               27752  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call
proc.time()-startTimeModule
##    user  system elapsed 
## 340.438   0.465 344.670

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_Regression.rds")
proc.time()-startTimeScript
##     user   system  elapsed 
## 75103.79    72.67 75997.47
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@42d3bd8b}"